Skip to content

Comments

otel: add new module#61907

Open
bengl wants to merge 1 commit intonodejs:mainfrom
bengl:bengl/otel
Open

otel: add new module#61907
bengl wants to merge 1 commit intonodejs:mainfrom
bengl:bengl/otel

Conversation

@bengl
Copy link
Member

@bengl bengl commented Feb 20, 2026

Adds support for OpenTelemetry tracing, as described in the included docs. Custom instrumentation is not yet supported.

This PR is intended to lay the groundwork for future PRs where API support for custom instrumentations is added. The OpenTelemetry SDK is explicitly not used, mainly to avoid bloat.

A programmatic API to enable and disable tracing is added as a new module, node:otel, but I'm very open to suggestions in terms of where to add it instead, keeping in mind that the intent is to eventually expose an API for custom instrumentation alongside it.

Disclaimer: As an experiment, Claude Code was used to build this PR, based on a hand-written prototype I built (but did not publish) about a year ago. Everything was thoroughly reviewed by me prior to PR submission, including some hand-edits.

Adds support for OpenTelemetry tracing, as described in the included
docs.

Custom instrumentation is not yet supported.
@nodejs-github-bot
Copy link
Collaborator

Review requested:

  • @nodejs/config
  • @nodejs/loaders
  • @nodejs/startup

@nodejs-github-bot nodejs-github-bot added lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run. labels Feb 20, 2026
Copy link
Member

@legendecas legendecas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, it would create confusion that the built-in support does not work with the @opentelemetry/api package, which is intended to be the global provider for otel components, like propagators and exporters.

@mcollina
Copy link
Member

I personally find the opentelemetry apis really cumbersome to work with. Maybe there is something better that can be done.

But I also concur that some kind of high level compatibility api would need to exist.

@ljharb
Copy link
Member

ljharb commented Feb 20, 2026

It would be really nice to get ahold of https://www.npmjs.com/package/otel and add this new core module without the prefix.

otel.start({ endpoint: 'http://collector.example.com:4318' });

// ... application code ...
otel.stop();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More a style nit... feel free to ignore. I find this start()/stop() pattern a bit cumbersome. I'd be happier something like...

const stop = otel.start();

stop();

which also allows for...

{
  using session = otel.start();
  // automatically stops
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While using is great, I don't think it's actually appropriate here. This is for enable and disabling an entire subsystem. stop() would probably actually be pretty rare, probably only used for some kind of error recovery / process-teardown. It definitely doesn't fit the usual use case for using, where we'd expect a resource to be disposed at the end of a variable scope.

* `options` {Object}
* `endpoint` {string} The OTLP collector endpoint URL.
**Default:** `'http://localhost:4318'`.
* `filter` {string|string\[]} Optional comma-separated string or array of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we pick one? I'd prefer just string[].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the way it is for consistency with the environment variable. It could be an array here in the API and then comma-separated in the environment variable.

@legendecas
Copy link
Member

legendecas commented Feb 20, 2026

This PR is intended to lay the groundwork for future PRs where API support for custom instrumentations is added. The OpenTelemetry SDK is explicitly not used, mainly to avoid bloat.
keeping in mind that the intent is to eventually expose an API for custom instrumentation alongside it.

Agree that an OpenTelemetry SDK is not built-in to the Node.js instrumentation. However, this implementation is not fully compliant with the OpenTelemetry design. The instrumentation should not implement the OpenTelemetry SDK components, like exporters and tracer SDK.

The three OpenTelemetry layers, API, SDK, and instrumentations, have their clear scopes and allows open customization without any vendor lock-in. That is, instrumentations will use the API to create spans and metrics, but do not directly depend on the SDK. This allows instrumentations to be vendor-agnostic. And observability vendors could provide their SDK (or just an exporter) to do whatever propriatary works, like exporting.

I think a Node.js built-in otel module should provide OpenTelemetry specification compliant API and implementations. A built-in OpenTelemetry should not split the OpenTelemetry community.

Copy link
Member

@Qard Qard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small questions/ideas, but otherwise LGTM!

### `OTEL_SERVICE_NAME`

Standard OpenTelemetry environment variable used to set the service name in
exported resource attributes. Defaults to `node-<pid>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have to load the package.json during startup, don't we? So perhaps we could make this use the name field from it, if present, and fallback to that?

Comment on lines +39 to +54
let timeOriginNs;
function getTimeOriginNs() {
if (timeOriginNs === undefined) {
// getTimeOriginTimestamp() returns ms since Unix epoch.
// Multiply by 1e6 to get nanoseconds.
const originMs = getTimeOriginTimestamp();
timeOriginNs = BigInt(MathRound(originMs * 1e6));
}
return timeOriginNs;
}

function hrTimeToNanos() {
const relativeMs = now();
const ns = getTimeOriginNs() + BigInt(MathRound(relativeMs * 1e6));
return `${ns}`;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering a bit about the perf of this. Did you try a high/low pair like process.hrtime() returns to be able to do the 64-bit timestamp rather than this string approach? I'm a bit worried about the perf implications of lots of bigint timestamps being generated all the time. 🤔

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OTEL hrtime, as far as I have been able to determine, is not compatible with process.hrtime. It is based upon performance.timeOrigin -- https://github.com/newrelic/node-newrelic/blob/39d1d9a59ca467d71d4a7db4cf437d743a24939b/lib/otel/normalize-timestamp.js#L56-L85

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from performance concerns, is it a good idea to depend entirely on a nonmonotonic clock? This may result in negative duration or out of order spans.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, shouldn't it be BigInt(MathRound(originMs)) * 1_000_000n to avoid loss of precision?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mean using process.hrtime itself, I mean using the pattern of separating high/low parts up to the encoding edge so we don't need to be operating on bigints. The public API surface could also convert to whatever types we need, if we want to duplicate official OpenTelemetry behaviour exactly. I'm just thinking the internals could be a lot better about perf.

@codecov
Copy link

codecov bot commented Feb 20, 2026

Codecov Report

❌ Patch coverage is 90.18036% with 98 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.75%. Comparing base (9a237cd) to head (ae2b9e3).
⚠️ Report is 46 commits behind head on main.

Files with missing lines Patch % Lines
lib/internal/otel/instrumentations.js 78.75% 50 Missing and 1 partial ⚠️
lib/internal/otel/flush.js 88.48% 34 Missing and 1 partial ⚠️
lib/internal/otel/span.js 93.71% 10 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #61907      +/-   ##
==========================================
+ Coverage   89.73%   89.75%   +0.01%     
==========================================
  Files         675      680       +5     
  Lines      204674   205883    +1209     
  Branches    39330    39565     +235     
==========================================
+ Hits       183667   184781    +1114     
- Misses      13283    13387     +104     
+ Partials     7724     7715       -9     
Files with missing lines Coverage Δ
lib/internal/bootstrap/realm.js 96.21% <100.00%> (+0.21%) ⬆️
lib/internal/otel/core.js 100.00% <100.00%> (ø)
lib/internal/otel/id.js 100.00% <100.00%> (ø)
lib/internal/process/pre_execution.js 97.63% <100.00%> (+1.66%) ⬆️
lib/otel.js 100.00% <100.00%> (ø)
src/node_options.cc 76.30% <100.00%> (+0.02%) ⬆️
src/node_options.h 97.90% <100.00%> (+0.01%) ⬆️
lib/internal/otel/span.js 93.71% <93.71%> (ø)
lib/internal/otel/flush.js 88.48% <88.48%> (ø)
lib/internal/otel/instrumentations.js 78.75% <78.75%> (ø)

... and 56 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

urlStr + '/v1/traces';
}

const parsed = new URL(urlStr);
Copy link

@SerenModz21 SerenModz21 Feb 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition could be simplified to the following:

const parsed = new URL('/v1/traces', endpoint);

Tests with the Node REPL:

> new URL("/v1/traces", "http://localhost:4318/hello?foo=bar").href
'http://localhost:4318/v1/traces'
> new URL("/v1/traces", "http://localhost:4318/").href
'http://localhost:4318/v1/traces'
> new URL("/v1/traces", "http://localhost:4318").href
'http://localhost:4318/v1/traces'
> new URL("/v1/traces", "http://localhost:4318/v1").href
'http://localhost:4318/v1/traces'
> new URL("/v1/traces", "http://localhost:4318/v1/traces").href
'http://localhost:4318/v1/traces'
> new URL("/v1/traces", "http://localhost:4318/v1/traces/").href
'http://localhost:4318/v1/traces'

@dyladan
Copy link

dyladan commented Feb 21, 2026

I have a lot of thoughts about this so I'll try to keep it succinct and direct. In the interest of full disclosure, I have been one of the maintainers of OTel JS for years and that will likely color some of my opinions. I am not on the OTel Governance Committee and the opinions here should be taken as my own and not as the official stance of the project.

I'll start with the things I like. First, I think having built-in support for OTel is great, and focusing on HTTP first makes sense. The lack of interoperability with existing instrumentation means the use of this will be very limited for now, but that's probably fine. Anyone who needs other instrumentation has the option to use the existing otel libraries. Second, i think it is reasonable to only support OTLP. Supporting other export formats is a well intentioned part of OTel, but it was a lot more important in the earlier days when OTLP support was limited. Now all major vendors support OTLP, and the collector serves as a mature routing layer for people who need to use other formats. I would prefer to have OTLP protobuf as an option since it is by far the most used OTLP format, but JSON makes sense as the de facto standard format for JS remote calls and I understand the desire to avoid the protobuf dependency. Third, I do see the value in a reimplementation which is node-specific and avoids the need for various platform-specific workarounds. Because this is shipped with node itself, there is no need to retain workarounds for old node versions or non-node platforms like browsers. Finally, this seems really easy to use. The lack of configuration, customization, and manual instrumentation definitely makes this simpler to understand and use for end users, and because it is built in there is no need to make sure your otel dependency versions align with each other.

I do have a couple issues with the way this is implemented. This seems to have completely done away with most of the features required by the OTel specification. For example, it is missing all of the normal extension points (configuration, propagators, samplers, id generators, span processors, exporters, resource, entities). I'm not sure if this was an unintentional side-effect of vibe coding a replacement SDK or an intentional choice to "avoid bloat," but it really undercuts a lot of the value of OTel. I might even go so far as to say it's not quite accurate to call this "OpenTelemetry support" without those things. It might be better to call this "OTLP span export" or something like that. At the very least I'd make sure you reach out to the OTel Governance Committee to obtain permission to use the OpenTelemetry trademark. I'm not a lawyer and not on the GC, but this is an area where I think it pays to be careful.

Finally, and this is my biggest concern, I agree with @legendecas that this seems like it would fracture the otel ecosystem on node. I am most concerned about the complete lack of interoperability with existing otel libraries. It seems to me that there is definitely a way that this could be done that would retain interoperability and allow a gradual transition to a built in implementation without such a fracture. I understand that obviously node can't take a dependency on an NPM module, but I'm certain the otel JS SDK would not have any issue taking a dependency on built-in modules where the experience could be improved for users.

@bengl
Copy link
Member Author

bengl commented Feb 23, 2026

@legendecas @dyladan

The PR description probably could have used a bit more details. Sorry about that.

First, let me address compatibility/fracturing and support (or lack thereof) for @opentelemetry/api. This PR is intended as an initial phase, supporting OTLP export of internally-created spans for HTTP inbound and outbound only, for now. API support is intended to be handled in a subsequent PR, including support for Span creation/manipulation via @opentelemetry/api, similar to how it's done in Deno. Requiring the usage of a completely different API would not be in scope. Bottom line: fracturing is an anti-goal of this PR, and my intended subsequent PR to add API support, so I'm happy to make whatever changes here that will prevent fracturing and still enable good built-in support.

Second, I agree that calling this PR "OpenTelemetry support" could be misleading if users expect full API support. I'm happy to rename things as appropriate, or move the programmatic activation to some other module (though, I kept it as a new module here knowing that I'd eventually want to export a TracerProvider implementation, etc.). The implementation here is intended to be minimal to avoid bloat and improve performance, particularly for the minimal case it provides for. It can serve as a core for future support for the extension points mentioned via API support, but, of course, that isn't there yet.

The tracing subsystem automatically propagates [W3C Trace Context][] across HTTP
boundaries:

* **Incoming requests**: The `traceparent` header is read from incoming HTTP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should either include tracestate handling otherwise it's not complete from W3C spec point of view.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With tracestate unused by this internally it may be as simple as propagating it with the span so it can be attached on client spans. No need to parse it for now.


This module is only available under the `node:` scheme. It requires the
`--experimental-otel` CLI flag or one of the `NODE_OTEL` / `NODE_OTEL_ENDPOINT`
environment variables.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should clearly mention the limitations of the current state like

  • incompatible with OTel components from the OpenTelemetry project
  • trace.id/span.id duplication
  • likely overwrite of W3C tracecontext headers in outgoing requests

I guess a user has to decide to use either the one or the other OTel variant and avoid to use both in the same node.js process.


> Stability: 1 - Experimental

When set to a non-empty value, enables the built-in OpenTelemetry tracing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really any value? e.g. having NODE_OTEL=disable or NODE_OTEL=0 to enable it seems a bit uncommon.

Quite some other node env vars check on the value 1.

Uint8Array,
} = primordials;

const { randomFillSync } = require('internal/crypto/random');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it required to use cryptographically strong pseudorandom here? Or is the perf impact don't care?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required to be crypto random by spec

}
}

function flush() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't flush return a promise to allow users to wait until all data is sent? e.g. before shutdown.

}

if (isModuleEnabled('node:http')) {
sub('http.server.request.start', onHttpServerRequestStart);
Copy link
Member

@IlyasShabi IlyasShabi Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to check is it's active and module enabled before subscribing


// Generate W3C traceparent header value.
// Format: {version}-{trace-id}-{span-id}-{trace-flags}
inject() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename to getTraceParent(). This function doesn't inject anything as of now.

if (typeof value === 'boolean') {
return { boolValue: value };
}
return { stringValue: String(value) };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there intentionally no array support?

lastExportWarningTime = now;
const suffix = exportFailureCount > 1 ?
` (${exportFailureCount} total failures)` : '';
process.emitWarning(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe otel events should be emitted on the otel object instead process.


const transport = parsed.protocol === 'https:' ? https : http;

const req = transport.request({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider to use an http agent to keep the connection alive for some time.

}
}

function sendToCollector(body) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no throttling, so if a lot spans are ended and/or connection is slow quite a lot parallel requests might be created.

Comment on lines +55 to +58
request[kSpan] = span;

const storage = getSpanStorage();
storage.enterWith(span);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do request[kSpan] if you're also using AsyncLocalStorage? Is this just a performance optimization to avoid calling ALS again on response finish?

@jsumners-nr
Copy link

The implementation here is intended to be minimal to avoid bloat and improve performance, particularly for the minimal case it provides for.

A very welcome goal. We are extremely interested in this moving forward. We haven't even technically released/announced our support for bridging OTEL signals into our agent, and we have already had complaints about the insane bloat the OTEL dependencies add as disk size, and complaints about the memory impacts of just loading them. But we really don't have the time to go off and re-implement the OTEL API+SDK just to elide all of the things the reference implementations bring in.

If we can set up a GitHub project of issues to resolve, that would make it far easier for those interested to contribute and get this feature fully implemented.

@legendecas
Copy link
Member

legendecas commented Feb 23, 2026

@dyladan: I understand that obviously node can't take a dependency on an NPM module, but I'm certain the otel JS SDK would not have any issue taking a dependency on built-in modules where the experience could be improved for users.

Actually, Node.js can take a dependency on an NPM module. There are precedents like amaro, minimatch, and undici.

@bengl: fracturing is an anti-goal of this PR, and my intended subsequent PR to add API support

Thank you for clarifying on this! I really appreciate it that fracturing is not the goal of this PR. It was really confusing that this PR did every OpenTelemetry API/SDK/Instrumentation from scratch but not complete, despite that there are official OpenTelemetry JS implementation.

The implementation here is intended to be minimal to avoid bloat and improve performance, particularly for the minimal case it provides for.

The official open-source OpenTelemetry JS impl is implementing the OpenTelemetry standard specification to support a variaty of data pipelines and configurations. If the goal of the Node.js OpenTelemetry support is to be fully OpenTelemetry spec compliant, it is really confusing that how this PR and the following-up work to "avoid bloat" in the end, essentially replicating the OpenTelemetry JS impl inside Node.js.

There are precedents that an npm package can be bundled in Node.js, and @opentelemetry/api would be a minimum dependency, as itself has zero dependency.

request[kSpan] = span;

const storage = getSpanStorage();
storage.enterWith(span);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this enterWith() leaks as there is no enterWith() using the previous store.

);

span.setAttribute('http.request.method', method);
span.setAttribute('server.address', request.origin || '');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to set attribute only if we have an actual value instead sending an empty string.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure the empty string is consistent with the reference implementation. But it's too difficult to find to link to it.

Comment on lines +3557 to +3558
subsystem and directs spans to the specified OTLP/HTTP collector endpoint. The
`/v1/traces` path is appended automatically. Also makes the `node:otel` module

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a full URI with path is provided, then /v1/traces should not be appended.


const SPAN_KIND_INTERNAL = 1;
const SPAN_KIND_SERVER = 2;
const SPAN_KIND_CLIENT = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be helpful to have some link to the source of this magic constants.
I assume they can be found somewhere in OTLP spec.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function onHttpClientRequestCreated({ request }) {
if (!isActive() || !isModuleEnabled('node:http')) return;

if (request.getHeader('host') === getCollectorHost()) return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should use a more fine grained exporter filter here as the same host might have more endpoints and some of them might be of interest.

@Qard
Copy link
Member

Qard commented Feb 23, 2026

I would prefer to have OTLP protobuf as an option since it is by far the most used OTLP format, but JSON makes sense as the de facto standard format for JS remote calls and I understand the desire to avoid the protobuf dependency.

I have an equivalent of https://github.com/datadog/pprof-format for OTLP that I'd like to add to this at some point, when I get to finishing it, so that may get resolved before too long. 🙂

This seems to have completely done away with most of the features required by the OTel specification. For example, it is missing all of the normal extension points (configuration, propagators, samplers, id generators, span processors, exporters, resource, entities). I'm not sure if this was an unintentional side-effect of vibe coding a replacement SDK or an intentional choice to "avoid bloat," but it really undercuts a lot of the value of OTel. I might even go so far as to say it's not quite accurate to call this "OpenTelemetry support" without those things.

In my opinion it's a feature not having all that. The majority of users don't actually need most of those things, and there's very real overhead to needing to delegate across this complicated cluster of types in hot code. A minimal API with sensible defaults makes a lot more sense to me.

Finally, and this is my biggest concern, I agree with @legendecas that this seems like it would fracture the otel ecosystem on node. I am most concerned about the complete lack of interoperability with existing otel libraries.

There's a sort of irony that the thing which purports to prevent vendor lock-in is so rigidly unable to allow using just a subset of its features.

In my opinion the full OpenTelemetry API exists at a level which caters to end-users, not to implementors. To me arguing that the OpenTelemetry API should exist in core is like arguing that the express router should exist in core--it is an opinionated user-facing API surface, not core infrastructure. It is the wrong level for the actual implementor concerns, this is what diagnostics_channel was created for.

Having some simplistic support for producing OTLP in core makes sense to me, but I personally don't actually see the value in having the entire API in core, especially when it already exists and is completely usable as-is in userland. I do respect that OpenTelemetry seems to have "won" the race and achieved critical mass in adoption, but I think just directly absorbing the decisions it has made is a mistake. We should be focusing on what parts of its internals suffer from the limitations of being in userland and fixing those problems more surgically.

Over time it's likely more of the infrastructure would accumulate in core and we may eventually end up with the whole API, but just pulling it all in at once is likely to just rot and takes the focus away from finding the actual problems which convince one that there is benefit to being in core.

I'm not generally one for the small core ethos at this point, it's good to have a full toolbox, but I would be hesitant to absorb such a strongly opinionated API without deeply considering why each decision was made and if that is a decision we want to be making in core.


I realize the message prior to this may sound a bit harsh, which I don't intend it to be. I would like to see at least OTLP export in core, and I very much understand the utility of the broader capability which OpenTelemetry provides. I just think we should exercise caution in pushing for the entire API rather than starting small and building on top with small building blocks. If we're going to adopt a significant new API surface that should involve significant engineering review and deep thinking on the implications.

I'm quite happy this went for the minimal surface area it did just to get a usable tool in there which can be expanded to enable greater extensibility later. I would very much like to see the scope of this not bloat significantly to try and cover every use case of OpenTelemetry. I would much prefer we take the time to learn and discover what actually needs to be in core to enable the ecosystem around observability on Node.js to thrive.

@legendecas
Copy link
Member

@Qard: I would be hesitant to absorb such a strongly opinionated API without deeply considering why each decision was made

Please note that the OpenTelemetry does define specification for language APIs, SDK interfaces, semantic conventions and OTLP as well, as part of its standard: https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification. OpenTelemetry language maintainers could leverage language-specific designs, but the general API interfaces are subject to the OpenTelemetry specifications. This allows homogenous data pipelines and data structure at the client side and defines the assumption how to consume the OpenTelemetry data.

span.setAttribute('http.request.method', method);
span.setAttribute('server.address', request.origin || '');

const url = `${request.origin}${request.path}`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

above request.origin is defaulted to "" - here this would default to undefined.

@Flarna
Copy link
Member

Flarna commented Feb 23, 2026

I agree that there is no need to pull in the full OTel SDK/API - and for sure not in the intial PR.
I also agree that a reduced feature set/limited customization is ok as users can anyway decide to use the OTel SDK if needed.

If parts like an OTLP exporter or HTTP instrumentations are in core it would be really nice to have them available also for usage with the OTel SDK to avoid that users end up in 1.75 OTel SDKs in many node.js process.

@Qard
Copy link
Member

Qard commented Feb 23, 2026

Yes, I'm very aware of the depth of specification defined by OpenTelemetry. That's exactly what makes me hesitant about it. The specification down to cross-language API surface is heavy-handed. All that ultimately matters to implementors is that it can produce the same shape and semantics of output. The OTLP format specification alone is all that was ever really needed for implementors.

The OpenTelemetry API imposes a lot of API design decisions which are not idiomatic or even performant across all languages. It also encodes a lot of baggage into its specification by having set its specification in a time before a particular way to use it had become dominant.

OpenTelemetry is a specification built around integration, not interoperability. Meaning it takes every idea it likes and encodes it directly into one monolithic specification for the entire category of Observability rather than defining narrower specifications of different components of the architecture to be interoperable with each other to leave room for alternative interfaces to those interoperability points.

Consider HTTP specifications, we were able to define Server-Sent Events as a layer over the cluster of specifications which define HTTP because it was built to be interoperable. Despite the concepts applied by SSE not existing when HTTP was defined, it was able to layer over HTTP reasonably cleanly by simply having the flexibility and interoperability to do so. OpenTelemetry does not have this. It defines a complex end-to-end model which leaves little room for interoperability with anything not integrated into its specification directly.

Just because something has a specification doesn't mean it's good or should be encouraged that others follow it. We should examine what exactly core and the ecosystem gains from shipping that in core as opposed to continuing to let it flourish in userland where it has been doing just fine. We should try to articulate what actually is the benefit of putting it in core.

Do we want an out-of-the-box experience? If so, would it not make more sense to have a sensibly-defaulted and simplified thing? Is it to address some performance or design constraints? Perhaps those specific things should be raised to figure out how they can be addressed more directly rather than assuming that mere presence in core would resolve whatever concerns exist?

I've heard many with a vested interest in OpenTelemetry stating it as a given that it should just be in core, but I have not heard a single explanation as to why it is important that the entire API surface exists in core. That's not to say it should not be, but rather that we should seek much greater clarity on exactly what and why.

@jsumners-nr
Copy link

I've heard many with a vested interest in OpenTelemetry stating it as a given that it should just be in core, but I have not heard a single explanation as to why it is important that the entire API surface exists in core.

In my personal view: because of the reasons you stated above. The reference implementation is a real pain to work with and seems way over engineered. I would much rather see a simple implementation of the API+SDK in core as a standard so that we can rely on it to generate spans/whatever for any library that wants to generate OTEL data; right down to accepting the registration of custom exporters and the like.

@dyladan
Copy link

dyladan commented Feb 23, 2026

I realize the message prior to this may sound a bit harsh, which I don't intend it to be.

Don't worry about it. I find direct feedback a lot easier to read anyway and you're not going to hurt my feelings. Nobody is more aware than me of the shortcomings of OTel JS and I don't delude myself into thinking everyone's experience is all roses.


FWIW I also agree that the entire OpenTelemetry SDK does not need to be in core. The separation between API and SDK was meant to allow library authors to take a dependency on the API without taking a dependency on the SDK so that the end user can insert an SDK implementation that made sense for them. Some vendors have taken advantage of this in multiple languages to publish stripped-down SDKs with less features or dropping support for old/alternative runtimes to gain other advantages like performance, memory use, or deployment size. When they're built into the runtime, that distinction loses a lot of value. IMO, the vendor neutrality of OTel comes from OTLP much more than it comes from custom SDK implementations and the API/SDK split.


In my opinion it's a feature not having all that. The majority of users don't actually need most of those things, and there's very real overhead to needing to delegate across this complicated cluster of types in hot code. A minimal API with sensible defaults makes a lot more sense to me.

It seems to me like you're referring to what OTel calls the SDK not the API. More details in the next section but the terminology is important to make sense of what I'm about to say.

This is where the API/SDK split shines IMO. These features may be only useful to a subset of the community, but that isn't the same thing as saying they're useless. I think this part is where the biggest risk of ecosystem fracture comes in. If node builds in an API people will use it. If that API goes into a blackbox SDK not accessible to users like in this implementation, it will be impossible for users to define these things. Some users who do need it would then use the OTel SDK but would receive a significantly degraded experience. If there is 2-way interop between the built-in API and the OTel API we could avoid this, allowing users who need the full suite of features of the OTel SDK and users who only need the minimal built in version to share the same instrumentations and avoid too much duplication of work. In the ideal case, the upstream SDK would use the built-in versions of the basic features where possible and only implement the "power user" features.


WRT reducing API surface I assume you're referring to the SDK not the API. In OpenTelemetry parlance, the tracing API itself is not a massive API surface. The way it is implemented results in more code than necessary and if I built it again today I'd make different decisions (which we have the freedom to do here). A lot of that code focuses on cases that aren't important here (multiple different API versions in 1 process, instrumentation startup order, etc) and can be removed or reworked if there's a built-in API. Modern features like diagnostics channels that didn't exist at the time also provide some breathing room. The API itself is only a TracerProvider with a single method getTracer(): Tracer, a Tracer with a single method startSpan(): Span, and a Span with a couple methods for ending span, adding attributes, links, and span events, and setting status. Everything else (Propagators, SpanProcessors, IdGenerators, Exporters, Samplers, etc) are concepts of the SDK and are not accessible from the API.

You may be interested in a POC I wrote as an example of "if I built it again today I'd make different decisions." It uses a simplified clone of diagnostics channel (in order to be usable on web also) and results in a DRASTIC reduction in overall code. This also focuses on retaining existing API features like api.diag that I would probably not bring into a node core implementation. The full API with tracing, metrics, and logs results in a 3.5k output file. Removing metrics, logs, and diag would reduce that even further. This obviously retains the OTel-specified API/SDK split so it works differently than a built-in one likely would, but it is I hope interesting food for thought.

Trace API (with the exception of the channel and some shared types, this 218 line file is the whole thing): https://github.com/dyladan/api-2-poc/blob/main/src/trace.ts
Example how SDK consumes it (this is nowhere near a full SDK implementation): https://github.com/dyladan/api-2-poc/blob/main/sdk.ts#L70

@Qard
Copy link
Member

Qard commented Feb 23, 2026

IMO, the vendor neutrality of OTel comes from OTLP much more than it comes from custom SDK implementations and the API/SDK split.

💯 this. OTLP is the only part of OpenTelemetry that I feel is fairly universal. Even that I don't entirely agree with as it imposes some decisions like span data lives for the life of a span rather than being able to separate start event from end event, but it's universal enough to work for most use cases effectively.

No spec is perfect, some weaknesses like that are to be expected. If specs are defined to be interoperable then that also enables them to be replaceable. http1 made a bunch of mistakes that http2 tried to address, it made a bunch of its own mistakes which were then addressed by http3. The different http protocol versions are hugely different, and yet it was feasible to make such a transition because they were always designed with interoperability in mind.

It seems to me like you're referring to what OTel calls the SDK not the API.

Mainly, yes. The SDK over-complicates things with a bunch of interfaces the majority of users will never need--samplers, propagators, id generators, span processors, exporters, etc. Sensible defaults for all of those are plenty for 90% of cases. Not a fan of the whole TracerProvider -> Tracer -> Span flow to do basically anything, but that's not a significant issue. It's mainly just a usability thing that it adds a bunch of steps the majority of users should never need to care about--Node.js could simply hold a Tracer internally and just have a module-exported startSpan always use that without us needing to care about the pattern of multiple TracerProviders or Tracers.

Your minimal version looks pretty good, actually. I particularly like that it's not holding span state on the span objects. That is, in my opinion, the most significant anti-pattern in the Observability segment, which the object-based model of OpenTelemetry has not helped with avoiding. Whatever we land on in core, we should not be holding data unnecessarily and especially not making it retrievable in any way beyond just getting the OTLP output out the other end.

It uses a simplified clone of diagnostics channel (in order to be usable on web also)

This exists, btw. 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run.

Projects

None yet

Development

Successfully merging this pull request may close these issues.